Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tool-based annotation of out of memory errors. #5196

Merged
merged 1 commit into from Dec 27, 2017

Conversation

jmchilton
Copy link
Member

  • Update explicit stdio definitions as well as detect_errors-based definitions to allow this.
    • If detect_errors="exit_code" is set, an extra attribute oom_exit_code can be defined on the same command block.
    • If detect_errors='aggressive" is set, Galaxy will now look for newer strings indicating common memory problems. (Do we need to restrict this to just newer profile version tools for greater reproducibility or do we want to catch more errors on older tools?).
  • Update runner code to allow resubmission in such cases (and other tool errors).

Fixes #3320
Implements #3107

@AsierGonzalez
Copy link

Hi @jmchilton, this sounds incredibly interesting to me as we cannot use HISAT2 on collections with the wheat genome because more than 2 jobs in the same node make them crash silently. The jobs crash due to "out of memory" error but in the history they appear in green, so you do not realise this until downstream tools fail. Would your solution solve this problem?

@gregvonkuster
Copy link
Contributor

This is fantastic @jmchilton - it will be extremely beneficial to my work, and I'm looking forward to making use of it. Thanks!

@bernt-matthias
Copy link
Contributor

Unfortunately I can't say much to the changes of the code. But from the description and my understanding of the code I definitely think that this PR is a step in the right direction and can be really helpful.

My opinion is still that memory errors should be detected by the runner -- as far as possible (and for the other cases your changes are really useful). My reasoning is the following: like run time violations (which are already handled by the runner) also memory violations are not the tool's fault, but a fault of the runner (which requested to little resources or was just unable to ensure that sufficient resources are available). (Similarly also cluster errors that are mentioned in #3320 should be handled by the runner). A second line of reasoning is that the detection of memory errors should/must work for each and every tool. So if we let the tools decide if there is a memory violation one would need to enable the detection of memory violations in every tool. Because I guess (but maybe I'm wrong) that most tools don't check for memory problems, but use features of the programming language or just don't check at all, this would mean that most tools would need to check for the same error messages which means quite a bit of code duplication in the tool xml files.

Please note that if I write runner, then this seems the best place for me, but I'm not an expert. The important thing for me is that such problems should not be detected by single tools. So some of the job classes might also be a good place.

We have to distinguish programming language specific error messages (e.g. std::bad_alloc in the case of c++ programs) and program specific error messages (e.g. if the programs catch out-of-memory and writing a custom message to stderr/stdout or returning specific error codes).

Handling programming language specific error messages by the runner has the potential problem (as you mentioned earlier) that there might be tools where such an error does not indicate an error. The question with this is: does this affect a significant number of tools? It might be easier to implement an exception for these tools (e.g. by filtering the spurious messages from stdout/stderr) compared to adapting all tools to consider language specific error messages.

Clearly (as discussed earlier), tool specific memory error messages should be handled by the tool. For the detection of such messages your changes appear to go in this direction, but it seems that so far only exit codes can be used to detect OOM? Or is it possible to specify regular expressions for tool specific
memory error messages as well?

So overall my suggestion for this PR is to:

  • remove the programming language specific error messages from tool checks and move them to the runner (maybe BaseRunner?).
  • implement regular expression parsing in the tool xml to detect tool specific memory error messages

Maybe a bit of background on how I intend to implement job handling on our Galaxy instance (which uses a grid engine on a large cluster). Jobs are resubmitted with increasing run time starting with 15min and then 1h, 1day, 1week. A similar 2 stage escalation strategy is implemented for memory (we have normal nodes and high mem nodes). So with a few job destinations we cover all tools and input sizes -- without wasting to much run time. This of course relies on the detection of runtime and memory violations. All other strategies, eg. dynamic job destinations or specifying time/memory per tool seem impractical to me because they require to much time for administration or waste resources (the same tool started for bacteria and an eukaryote genome would get the same resources).

@jmchilton
Copy link
Member Author

@AsierGonzalez

This tool has custom stdio parsing - so it would need to be updated to catch this - the lines https://github.com/galaxyproject/tools-iuc/blob/master/tools/hisat2/hisat2.xml could become:

<stdio>
       <regex level="fatal" match="hisat2-align exited with value 1" source="both" />
       <regex level="fatal" match="hisat2: not found" source="both" />
       <exit_code range="1:" />
</stdio>
<stdio>
       <regex level="fatal" match="hisat2-align exited with value 1" source="both" />
       <regex level="fatal" match="hisat2: not found" source="both" />
       <regex level="fatal_oom" match="out of memory" source="both" />
       <exit_code range="1:" />
</stdio>

It is weird in general that the HiSat exit code is ignore though - that is a fix that should be made regardless of this change to Galaxy I suspect. I've created an issue on the IUC issue tracker galaxyproject/tools-iuc#1634.

@bernt-matthias
Copy link
Contributor

Great. Then the tool based errors seem to be covered by your PR. Still I would suggest that we move the language dependent errors (like std::bad_alloc) elsewhere.

@jmchilton
Copy link
Member Author

Great. Then the tool based errors seem to be covered by your PR.

Yeah - I rebased this with one small tweak that enabled this (I thought it already was) and then added some docs for it in galaxy.xsd.

Still I would suggest that we move the language dependent errors (like std::bad_alloc) elsewhere.

These ones are only used if detect_errors="aggressive" has been set on a tool. This flag already mixes common errors from different languages (to some extent at least) and the tool author has sort of declared they want to be really aggressive when searching the output for errors - so I think it is okay to combine languages here - the whole point of the flag I think is to find as many errors as possible. Does that make sense?

I've never really liked aggressive though - I encourage the use of exit_code instead. So I get the desire on some level to want to catch classes of errors based on the programming language I think - though I'm not totally sold. Do you have an idea for what the tool syntax might look like?

Would we make detect_errors a list:

<command detect_errors="exit_code,c_memory_errors">

Or maybe exit_code would be implicit some new detectors that were language based (e.g. c_errors, java_errors):

<command detect_errors="c_errors">

@bernt-matthias
Copy link
Contributor

You may leave it as it is.

I'm just looking for a way that a Galaxy admin can enable this behavior (checking for language specific memory errors) for all tools (regardless of the tool configurations).

Maybe job_conf?

@bgruening
Copy link
Member

If detect_errors='aggressive" is set, Galaxy will now look for newer strings indicating common memory problems. (Do we need to restrict this to just newer profile version tools for greater reproducibility or do we want to catch more errors on older tools?).

I think this is ok. This entire PR looks great to me.
@bernt-matthias I'm with you that this should be ideally handled by an admin with job_conf.xml, but this is a very nice step towards to help certain job schedules to make a good decision. So I don't see any drawbacks here.

@bernt-matthias
Copy link
Contributor

I'm completely with you. The PR is a great addition.

On job_conf.xml: Having a <stdio> tag in job_conf with the same functionality as the stdio tag in the tool xml files seems like a great idea to me.

@bgruening
Copy link
Member

@jmchilton can you please resolve the conflict. Thanks!

- Update explicit stdio definitions as well as detect_errors-based definitions to allow this.
- Update runner code to allow resubmission in such cases.
@bgruening
Copy link
Member

Thanks @jmchilton! I missed the rebase somehow.

What do you think about @bernt-matthias last comment, especially for older tools?

On job_conf.xml: Having a tag in job_conf with the same functionality as the stdio tag in the tool xml files seems like a great idea to me.

@bernt-matthias I see that this could be useful especially for older tools, but I would also encourage Admins to submit this upstream to the tools.

@bgruening bgruening merged commit c78a23c into galaxyproject:dev Dec 27, 2017
@bernt-matthias
Copy link
Contributor

Is there a way to document the new functionality at https://docs.galaxyproject.org/en/master/dev/schema.html?

@jmchilton
Copy link
Member Author

This should have done that. Our docs just stopped generating/publishing - the build time grew too large I think. I've increased the timeout and re-ran the last such Jenkins job. https://jenkins.galaxyproject.org/view/All/job/latest-Sphinx-Docs/

@nsoranzo
Copy link
Member

nsoranzo commented Feb 5, 2018

@jmchilton By looking at the code, it seems to me that StdioParser.parse_error_level() method in lib/galaxy/tools/parser/xml.py should also have been modified to return StdioErrorLevel.FATAL_OOM when err_level matches fatal_oom, or am I missing something?

jmchilton added a commit to jmchilton/galaxy that referenced this pull request Feb 5, 2018
Backend was enhanced to allow this in galaxyproject#5196 and a new entry point to setting exit code for such exit conditions but the custom parsing wasn't updated to allow it. Thanks to @nsoranzo for the catch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants